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Abstract 

Quasi-Newton methods are widely used in practise for convex loss minimization 
problems. These methods exhibit good empirical performance on a wide variety 
of tasks and enjoy super-linear convergence to the optimal solution. For large- 
scale learning problems, stochastic Quasi-Newton methods have been recently 
proposed. However, these typically only achieve sub-linear convergence rates and 
have not been shown to consistently perform well in practice since noisy Hessian 
approximations can exacerbate the effect of high-variance stochastic gradient esti¬ 
mates. In this work we propose VlTE, a novel stochastic Quasi-Newton algorithm 
that uses an existing first-order technique to reduce this variance. Without exploit¬ 
ing the specific form of the approximate Hessian, we show that VITE reaches the 
optimum at a geometric rate with a constant step-size when dealing with smooth 
strongly convex functions. Empirically, we demonstrate improvements over exist¬ 
ing stochastic Quasi-Newton and variance reduced stochastic gradient methods. 


1 Introduction 

We consider the problem of optimizing a function expressed as an expectation over a set of data- 
dependent functions. Stochastic gradient descent (SGD) has become the method of choice for such 
tasks as it only requires computing stochastic gradients over a small subset of datapoints |2l fl8j| . The 
simplicity of SGD is both its greatest strength and weakness. Due to the effects of evaluating noisy 
approximation of the true gradient, SGD achieves a convergence rate which is only sub-linear in the 
number of steps. In an effort to deal with this randomness, two primary directions of focus have been 
developed. The first line of work focuses on choosing the appropriate SGD step-size DQH3EI. If a 
decaying step-size is chosen, the variance is forced to zero asymptotically guaranteeing convergence. 
However, small steps also slow down progress and limit the rate of convergence in practise. The step- 
size must be chosen carefully, which can require extensive experimentation possibly negating the 
computational speedup of SGD. Another approach is to use an improved, lower-variance estimate of 
the gradient. If this estimator is chosen correctly - such that its variance goes to zero asymptotically 
- convergence can be guaranteed with a constant learning rate. This scheme is used in lf5l fl6ll where 
the improved estimate of the gradient combines stochastic gradients computed at the current stage 
with others used at an earlier stage. A similar approach proposed in ||8] [9) combines stochastic 
gradients with gradients periodically re-computed at a pivot point. 

With variance reduction, first-order methods can obtain a linear convergence rate. In contrast, 
second-order methods have been shown to obtain super-linear convergence. However, this requires 
the computation and inversion of the Hessian matrix which is impractical for large-scale datasets. 
Approximate variants known as quasi-Newton methods have thus been developed, such as the 
popular BFGS or its limited memory version known as LBFGS HD- Quasi-Newton methods such 
as BFGS do not require computing the Hessian matrix but instead construct a quadratic model of the 
objective function by successive measurements of the gradient. This also yields super-linear con¬ 
vergence when the quadratic model is accurate. Stochastic variants of BFGS have been proposed 
(oBFGS ifTTl l, for which stochastic gradients replace their deterministic counterparts. A regularized 
version known as RES 02 achieves a sublinear convergence rate with a decreasing step-size by 
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enforcing a bound on the eigenvalues of the approximate Hessian matrix. SQN 0, another related 
method also requires a decreasing step size to achieve sub-linear convergence. Although stochas¬ 
tic second order methods have not be shown to achieve super-linear convergence, they empirically 
outperform SGD for problems with a large condition number lfl2l . 

A clear drawback to stochastic second order methods is that similarly to their first-order counterparts, 
they suffer from high variance in the approximation of the gradient. Additionally, this problem can 
be exaggerated due to the estimate of the Hessian magnifying the effect of this noise. Overall, this 
can lead to such algorithms taking large steps in poor descent directions. 

In this paper, we propose and analyze a stochastic variant of BFGS that uses a multi-stage scheme 
similar to 0191 to progressively reduce the variance of the stochastic gradients. We call this method 
Variance-reduced Stochastic Newton (VlTE). Under standard conditions on J, we show that that 
variance reduction on the gradient estimate alone is sufficient for fast convergence. For smooth and 
strongly convex functions, VlTE reaches the optimum at a geometric rate with a constant step-size. 
To our knowledge VlTE is the first stochastic Quasi-Newton method with these properties. 

In the following section, we briefly review the BFGS algorithm and its stochastic variants. We 
then introduce the VlTE algorithm and analyze its convergence properties. Finally, we present 
experimental results on real-world datasets demonstrating its superior performance over a range of 
competitors. 

2 Stochastic second order optimization 

2.1 Problem setting 

Given a dataset V = {(x!, t/i), ..., (x„, y n )} consisting of feature vectors x, : £ and targets y* £ 
[0, C\, we consider the problem of minimizing the expected loss /(w) = E[/j(w)]. Each function 
/,:(w) takes the form /,(w) = £(h(w, xQ, t/Q, where t is a loss function and h is a prediction 
model parametrized by w £ K . The expectation is over the set of samples and we denote w* = 
arg min w /(w). 

This optimization problem can be solved exactly for convex functions using gradient descent, where 
the gradient of the loss function is expressed as V w /(w) = E[V w /,;(w)]. When the size of the 
dataset n is large, the computation of the gradient is impractical and one has to resort to stochastic 
gradients. Similar to gradient descent, stochastic gradient descent updates the parameter vector w t 
by stepping in the opposite direction of the stochastic gradient V w /j(w t ) by an amount specified 
by a step size r] t as follows: 

w t+ i = w t - ij t V w /i(w ( ). (1) 

In general, a stochastic gradient can also be computed as an average over a sample of datapoints 
as /(w t ) = r _1 /t( w t)- Given that the stochastic gradients are unbiased estimates of the 

gradient, Robbins and Monro 03 proved convergence of SGD to w* assuming a decreasing step- 
size sequence. A common choice for the step size is mm 

a) Vt = — or b) rj t = (2) 

t l o + t 

where 770 is a constant initial step size and To controls the speed of decrease. 

Although the cost per iteration of SGD is low, it suffers from slow convergence for certain ill- 
conditioned problems lfl2ll . An alternative is to use a second order method such as Newton’s method 
that estimates the curvature of the objective function and can achieve quadratic convergence. In the 
following, we review Newton’s method and its approximations known as quasi-Newton methods. 

2.2 Newton’s method and BFGS 

Newton’s method is an iterative method that minimizes the Taylor expansion of /(w) around w t : 

/(w) =/(w t ) + (w - w t ) T V w /(w t ) + -(w - w t ) T 7T(w - w t ), (3) 
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where H is the Hessian of the function /(w) and quantifies its curvature. Minimizing Eq. pleads to 
the following update rule: 

w t+ i = w t- mHt 1 ■ v/(w t ), (4) 

where ry is the step size chosen by backtracking line search. 

Given that computing and inverting the Hessian matrix is an expensive operation, approximate vari¬ 
ants of Newton’s method have emerged, where HjT 1 is replaced by an approximate version HjT 1 
selected to be positive definite and as close to H t _1 as possible. The most popular member of this 
class of quasi-Newton methods is BFGS liTTl that incrementally updates an estimate of the inverse 
Hessian, denoted J t = fT t _1 . This estimate is computed by solving a weighted Frobenius norm 
minimization subject to the secant condition: 

w t+ i - w t = J t+ i(V/(w t+ i) - V/(w t )). (5) 


The solution can be obtained in closed form leading to the following explicit expression: 


Jt+i — 


sy T \ 

y T s) 



ys T \ ss T 
y T s) y T s ’ 


( 6 ) 


where s = w t+ i — w t and y = V/(w t+ i) — V/(w t ). Eq. [ 6 ] is known to be positive definitive 
assuming that Jq is initialized to be a positive definite matrix. 


2.3 Stochastic BFGS 


A stochastic version of BFGS (oBFGS) was proposed in fTTI in which stochastic gradients are used 
for both the determination of the descent direction and the approximation of the inverse Hessian. 
The oBFGS approach described in Algorithm[T]uses the following update equation: 


w t+ i = w t - rjtJt • V/(w t ), 


(7) 


where the matrix j t and the vector V/(w t ) are stochastic estimates computed as follows. Fet 
A C {1... n} and B C {1... n} be sets containing two independent samples of datapoints. The 
variables y and V/(w) defined in Eq.[ 6 ]are replaced by sampled variables computed as 


V = ITT V /fc( w t+i) - V/ fc (w t ) and 
k&A 


V/(w t ) = V/ B (w t ) = — V/ fc (w t ). ( 8 ) 


fceB 


The estimate of the inverse Hessian then becomes 

r,T' 


Jt+i = H- 


sy_ 

y T s 


Jt I ~ 


ys 

y T s 


ss 

y T s 


(9) 


Unlike Newton’s method, oBFGS uses a fixed step size sequence instead of a line search. A common 
choice is to use a step size similar to the one used for SGD in Eq. [2] 

A regularized version of oBFGS (RES) was recently proposed in fl2ll . RES differs from oBFGS in 
the use of a regularizer to enforce a bound on the eigenvalues of J t such that 

+ i) /, 00) 

where 7 and S are given positive constants and the notation A A B means that B — A is a positive 
semi-definite matrix. Note that (fTTTb also implies an upper and lower bound on E[J t ] fl~2l . The 
update of RES is modified to incorporate an identity bias term 7 1 as follows: 

w t+ i = w t - rj t (J t + 7/) • V/(w t ). (11) 

The convergence proof derived in Ifl2l shows that lower and upper bounds on the Hessian eigenval¬ 
ues of the sample functions are sufficient to guarantee convergence to the optimum. However, the 
analysis shows that RES will converge to the optimum at a rate 0(l/t) and requires a decreasing 
step-size. Similar results were derived in J3] for the SQN algorithm. 


ll A J t A pi = 7 
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Algorithm 1 oBFGS 
1: INPUTS: 

2 : V : Training set of n examples. 

3: wo : Arbitrary initial values, e.g., 0. 

4: {r] t } : Step size sequence 

5: OUTPUT :w t 

6 : Jq i — olI 

7: for t = 0... T do 

8 : Randomly pick two sets A and B 

9: s <- w t+ i - W; 

10: y <- Efcet? V/fc(w t+1 ) - V/ fc (w t ) 

11 : V/(wt) Efce^t V/ fc (wt) 

12: w t+ i w t - ritJt+i ■ V/(wt) 

■3: A. - f£) 4 ('- &) + £ 

14: end for 


3 The Vite algorithm 


Reducing the size of the sets A and £> used to estimate the inverse Hessian approximation and the 
stochastic gradient is desirable for reasons of computational efficiency. However, doing so also 
increases the variance of the update step. Here we propose a new method called VlTE that explicitly 
reduces this variance. 


In order to simplify the analysis of VlTE, we do not explicitly consider the randomness in the 
matrix J t . Instead, we assume that it is positive definite (which holds under weak conditions due 
to the BFGS update step) and that its variance can be kept under control, for example by using the 
regularization of the RES method. 


To motivate V ITE we first consider the standard oLBFGS step, (0 estimated with the sets A and B. 
The first and second moments simplify as 


E [itV/ e (w t )] = J t E B [V/ B (w t )] 


( 12 ) 


and 


E 


ifV/ B (w t ) 




2 Eg ||V/ B (w t )|| 2 , 


(13) 


respectively. For |A.| large enough, in order to reduce the variance of the estimate J t ■ V/g(w t ), 
it is only required to reduce the variance of V/ B (w t ) independently. We proceed using a technique 
similar to the one proposed in |( 8 ] [9). 


VITE differs from oBFGS and other stochastic Quasi-Newton methods in the use of a multi-stage 
scheme as shown in Algorithm [2] In the outer loop a variable w is introduced. We periodically 
evaluate the gradient of the function with respect to w. This pivot point is inserted in the update 
equation to reduce the variance. Each inner loop runs for a a random number of steps tj € [1, to] 
whose distribution follows a geometric law with parameter /3 = Et=i(l — Stochastic 

gradients at w f and w are computed and the inverse Hessian approximation is updated in each 
iteration of the inner loop. J t can be updated using the same update as RES although we found in 
practice that using Eq. |9]did not affect the results significantly. The descent direction V/g(w) is 
then replaced by 

v t = V/ B (w t ) - V/ B (w) + v. 

VITE then makes updates of the form 

w t+ i = w t-rjJfVt. (14) 

Clearly, v = E[V/ B (w)] and E[v t ] = E[V/ B (w t )] so in expectation the descent is in the same 
direction as Eq. (IT2l) . Following the analysis of | 8 ), the variance of v t goes to zero when both w and 
Wf converge to the same parameter w*. Therefore, convergence can be guaranteed with a constant 
step-size. The complexity of this approach depends on the number of epochs S and a constant to 
limiting the number of stochastic gradients computed in a single epoch, as well as other parameters 
that will be introduced in more detail in Section|4] 
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Algorithm 2 Vite 

1: 

INPUTS : 


2: 

V : Training set of n examples 

Wo : Arbitrary initial values, e.g., 0 

3: 

77 : Constant step size 

to: Arbitrary constant 

4: 

OUTPUT : w t 


5: 

Jo 4 — olI 


6: 

for s = 0 ... S do 


7: 

w = W s _i 


8: 

* = £E?=iV/i(w) 


9: 

Wo = w 


10: 

Let tj <— t with probability V-mw) — 

— for t = 1,..., to 

11: 

for t = 0 ... ti — 1 do 


12: 

Randomly pick independent sets A, B C {1... n} 

13: 

v t = V/ B (w t ) - V/ B (w) + v 


14: 

w t+ i 3- w t - rjJt ■ V t 


15: 

Update J t +\ 


16: 

end for 


17: 

= w tj . 


18: 

end for 
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Analysis 



In this section we present a convergence proof for the VITE algorithm that builds upon and gener¬ 
alizes previous analyses of variance reduced first order methods J8JE0- Specifically, we show how 
variance reduction on the stochastic gradient direction is sufficient to establish geometric conver¬ 
gence rates, even when performing linear transformations with a matrix J t . Since we do not exploit 
the specific form of the stochastic evolution equations for J t , this analysis will not allow us to argue 
in favor of the specific choice of Eq. yet it shows that variance reduction on the gradient estimate 

is sufficient for fast convergence as long as J t is sufficiently well behaved. Our analysis relies on 
the following standard assumptions: 

A1 Each function is differentiable and has a Lipschitz continuous gradient with constant L > 0, 
i.e. Vw, veR", 

fi( w) < /i(v) + (w - v) T V/i(v) + ^||w- v|| 2 (15) 

A2 / is /r-strongly convex, i.e. Vw, v G M™, 

/(w) > /(v) + (w - v) T V/(v) + | ||w - v|| 2 (16) 

which also implies 

l|V/( w )|| 2 > 2/r(/(w) - /(w*)) Vw G R" (17) 

for the minimizer w* of /. 

Assumptions A1 and A2 also implies that the eigenvalues of the Hessian are bounded as follows: 

III V V LI. (18) 

Finally we make the assumption that the inverse Hessian approximation is always well-behaved. 

A3 There exist positive constants 7 and p such that, Vw G M r \ 

7 HJtl pi- (19) 

Assumption A3 is equivalent to assuming that Jt is bounded in expectation (see: e.g. H2) but 
allows us to remove this complication, simplifying notation in the analysis which follows. We now 
introduce two lemmas required for the proof of convergence. 
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Lemma 1. The following identity holds: 


E/(w s+ i) = - r t Ef(w t ) 

P t=o 

where Tt := (1 — and the weight vectors w t belong to epoch s. 

This result follows directly from Lemma 3 in |9l . 

Lemma 2. 

E||v t || 2 < 4L(/(w t ) - /(w*) + /(w) - /(w*)) 


The proof is given in ( 8 j (9) and reproduced for convenience in the Appendix. We are now ready to 
state our main result. 


Theorem 1. Let Assumptions A1-A3 be satisfied. Define the rescaled strong convexity p! := 77 /. < 
p and Lipschitz L' := pL > L constants respectively. Choose 0 < p < fjrz and let m be sufficiently 
large so that a = 


21 /' 


_ >1 

n'-2L'-r, 


< 1 . 


Then the suboptimality o/w s is bounded in expectation as follows: 


E(/(w s ) - /(w*) < a s E[/(w 0 ) - /(w*)]. (20) 


Remark 1. Observe that 7 and p are bounds on the inverse Hessian approximation. If J t is a good 
approximation to H, then by plugging in 7 = L and p = p, the upper bound on the learning rate 
reduces to p < ifipg- 


Proof of Theorem 1. Our starting point is the basic inequality 


/(w t+ i) = /(w t - r]J t ■ v t ) 


L 


< f(wt) - t?(V/(w t ), J t ■ v t ) + 


JtVt 


( 21 ) 


We first use the properties of v t and J t to reduce the dependence of (| 2 TT > on J t to its largest and 
smallest eigenvalues given by ( IT9t . For the purpose of the analysis, we define T t to be the sigma- 
algebra measuring w t . By conditioning on Tt, and by A3, the remaining randomness is in the 
choice of the index set B in round t, which is tied to the stochasticity of v t . Taking expectations 
with respect to B gives us 


Eg 


JfVt 


< ||J t || 2 Eg||v t || 2 <p 2 Eg||v t || 2 


( 22 ) 


and 

Es(V/(w t ),J t • v t ) = (V/(w t ), J t ■ V/(w t )) > 7 l|V/(wt )|| 2 (23) 

where (l23l ) comes from the definition Egv t = V/(w t ). Therefore, taking the expectation of the 
inequality (| 2 TT > and dropping the notational dependence on B results in 

E/(w t+ i) < E/(w t ) — 7 ?;E ||V/(w t )|| 2 + ^? 7 2 p 2 E||v t || 2 . (24) 


To simplify the remainder of the proof we make the following substitution 

p' := 7 p < p and L' := pL > L. 


Considering a fixed epoch s, we can further bound E/(w t+ i) using Lemma 2 and Eq.[T7] By taking 
the expectation over Tt, adding and subtracting /(w*), we get 

E[/(w t+ i) - /(w*)] <E[/(w t ) - /(w*)] + 2p 2 L' 2 (f (vv s ) - /(w*)) (25) 

+ 2(77 2 L ' 2 - pp')E[f{vf t ) - /(w*)] 

=2 t7 2 L ,2 (/(w s ) - /(w*)) + (2 p 2 L' 2 - 2 pp' + l)E[/(wt) - /(w*)]. 
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Writing A/(w t ) := /(w t ) — /(w*), we then have 

iw' ~ 2r? 2 L ,2 )EA/(w t ) < 2ij 1 L’ 2 A/(w s ) + (1 - ?j/i')EA/(w t ) - EA/(w t +i) (26) 


Now we sum all these inequalities at iterations t = 0,..., m — 1 performed in epoch s with weights 
Tt = (1 — r]/i') m ~ t ~ 1 . Applying Lemma[T|to the last summand to recover /(w s+ i) we arrive at 


/?EA/(w s+1 ) < 


2f3r) 2 L' 2 
rm' — 2rj 2 L' 2 


■EA/(w s ) 


, (1 ~ t?M / )EA/(w t ) - EA/(w t+ i) 
Tt rui' - 2r] 2 L' 2 


We now need to bound the remaining sum (*) in the numerator, which can be accomplished by 
re-grouping summands 

(*) =(1 - ? 7 //) m EA/(w s ) - (1 - w')EA/(w s+ 1 ) 


By ignoring the negative term in (*), we get the final bound 

EA/(w s+ i) < aEA/(w s ), 


where 


a = 


(i- wT 

— 2r] 2 L' 2 ) 


2 t?L' 2 \ 

rui 1 — 2ifL' 2 J 


□ 


Theorem 1 implies that VITE has a local geometric convergence rate with a constant learning rate. 
In order to satisfy E(/(w s ) — /(w*)) < e, the number of stages s needs to satisfy 

^ i -i, E (/(w 0 ) - /(w*)) 

s > - log a log-. 

e 

Since each stage requires n+m(2\A\-\-2\B\) component gradient evaluations, the overall complexity 
is 0((n + 2to(|^4| + \B\)) log(l/e)). 


5 Experimental Results 

This section presents experimental results that compare the performance of VITE to SGD, S VRG f§) 
which incorporates variance reduction and RES Ifl2l which incorporates second order information. 
We consider two commonly occurring problems in machine learning, namely least-square regression 
and regularized logistic regression. 

Linear Least Squares Regression. We apply least-square regression on the binary version of the 
Cov dataset g] that contains n = 581,012 datapoints, each described by d = 54 input features. 
Logistic Regression. We apply logistic regression on the Adult and IjcnnI datasets obtained 
from the LibSVM website Q The Adult dataset contains n = 32, 561 datapoints, each described 
by d = 123 input features. The IjcnnI dataset contains n = 49,990 datapoints, each described by 
d = 22 input features. We added an ^-regularizer with parameter A = 10~ 5 to ensure the objective 
is strongly convex. 

The complexity of VITE depends on three quantities: the approximate Hessian J, the pair of stochas¬ 
tic gradients (V/#(w), V/g(w)) and v, respectively computed over the sets A, B and V. Similarly 
to I TZI . we consider different choices for |Al| and \B\ and pick the best value in a limited interval 
{1,..., 0.05n}. These results are also reported for the RES method that also depends on both |*4| 
and \B\. For SGD, we use \B\ = 1 as we found this value to be the best performer on all datasets. 
Computing the average gradient, v over the full dataset for SVRG and VlTE is impractical. We 
therefore estimate v over a small subset C C V. Although this introduces some bias, it did not 
seem to practically affect convergence for sufficiently large \C\. In our experiments, we selected 
\C\ = O.ln samples uniformly at random. Each experiment was averaged over 5 runs with different 

’http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets 
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(a) Cov (b) Adult 




(c) Ijcnn 


Figure 1: The red and green curves are the losses achieved by RES and VlTE respectively for varying size 
of \B\ as a percentage of n. Each experiment was averaged over 5 runs. Error bars denote variance. In the 
regime \B\ < 0.1%, VlTE has a much lower variance and reaches a lower optimum value. Increasing \B\ 
further decreases the variance of the stochastic gradients but requires more gradient evaluations, decreasing 
the gap in performance between the methods. Overall, we found VlTE with \B\ = 1% and \B\ = 0.1% to 
perform the best. 


initializations of wo and a random selection of the samples in A, B and C. Given that the complexity 
per iteration of each method is different, we compare them as a function of the number of gradient 
evaluations. 

Fig.Q]shows the empirical convergence properties of VlTE against RES for least-square regression 
and logistic regression. The horizontal axis corresponds to the number of gradient evaluations while 
the vertical axis corresponds to the objective function value. The vertical bars in each plot show the 
variance over 5 runs. We show plots for different values of \B\ and the best corresponding A. For 
small \B\, the variance of the stochastic gradients clearly hurts RES while the variance corrections 
of VlTE lead to fast convergence. As we increase \B\, thus reducing the variance of the stochastic 
gradients, the convergence rate of RES and VlTE becomes similar. However, VlTE with small \B\ 
is much faster to converge to a lower objective value. This clearly demonstrates how using small 
batches for the computation of the gradients while reducing their variance leads to a fast convergence 
rate. We also investigated the effect of |yl on the convergence of RES and VlTE (see Appendix). 
In short, we find that a good-enough curvature estimate can be obtained for yl| = O(10~ 5 n). 
Increasing this value incurs a penalty in terms of number of gradient evaluations required and so 
overall performance degrades. 

Finally, we compared VlTE against SGD, RES and SVRG |SJ|9]. A critical factor in the perfor¬ 
mance of SGD is the selection of the step-size. We use the step-size given in Eq. [2j? and pick 
the parameters Tq and ?yo by performing cross-validation over Tq = { 1 , 10 , 10 2 ,..., 10 4 } and 
770 = {10” 1 ,..., 10 -5 }. Although it is a quasi-Newton method, RES also requires a decaying step- 
size and so the same selection process was performed. For SVRG and VlTE, we used a constant 
step size chosen in the same interval as 770 . For SVRG and VlTE we used the same size subset, C to 
compute v. Fig.[2]shows the objective value of each method in log scale. Although RES and SVRG 
are superior to SGD, neither clearly outperforms the other. On the other hand, we observe that V ITE 
consistently converges faster than both RES and SVRG. This demonstrates that the combination of 
second order information and variance reduction is beneficial for fast convergence. 
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(a) Cov 


(b) Adult 


(c) Ijcnn 


Figure 2: Comparison of RES and VlTE (trained with the best performing parameters) against SGD and 
SVRG. The reduction in variance for VITE is faster than SGD or RES which typically lead to faster convergence. 


6 Conclusion 

We have shown that stochastic variants of BFGS can be made more robust to the effects of noisy 
stochastic gradients using variance reduction. We introduced VlTE and showed that it obtains a 
geometric convergence rate for smooth convex functions - to our knowledge the first stochastic 
Quasi-Newton algorithm with this property. We have shown experimentally that V ITE outperforms 
both variance reduced SGD and stochastic BFGS. The theoretical analysis we present is quite gen¬ 
eral and additionally only requires that the bound on the eigenvalues of the inverse Hessian matrix 
in < IT9b holds. Therefore, the variance reduced framework we propose can be extended to other 
quasi-Newton methods, including the widely used L-BFGS and AdaGrad 0 algorithms. Finally, 
an important open question is how to bridge the gap between the theoretical and empirical results. 
Specifically, whether it is possible to obtain better convergence rates for stochastic BFGS algorithms 
which match the improvement we have demonstrated over SVRG. 
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7 Appendix 

7.1 Proof of Lemma 2 


E Ikll 2 = E ||V/i(w t ) - V/i(w) + v/(w)|| 2 

< 2E||V/i(w t ) — V/i(w*)|| 2 

+ 2E||(V/i(w) - V/i(w*)) - V/(w)|| 2 
= 2E ||V/i(w t ) — V/i(w*)|| 2 

+ 2E ||(V/i(w) - V/j(w*)) - (V/(w) - V/(w*))|| 2 

< 2E ||V/»(wt) — V/,;(w*)|| 2 
+ 2E ||V/j(w) — V/i(w*)|| 2 

< 4L(/(w t ) -/(w*) +/(w) -/(w*)) (27) 

The second inequality uses E ||£ — E£|| 2 = E ||^|| 2 — ||E^|| 2 < E ||^|| 2 lor any random vector £. 


The last inequality uses the following inequality derived from the fact that /,; is a Lipschitz function: 

E ||V/i(w*) - V/i(w t )|| 2 < 2L(/(w t ) - /(w*)). 


□ 


7.2 Selection of the parameter |A|. 

Figure [3] shows the effect of the set A , used to estimate the inverse Hessian, on the convergence 
of RES and VlTE. We show results for |A| = {0.00001, 0.0001} x n. Firstly we see that better 
performance is obtained for both methods for the smaller value of |A|. By increasing |A|, the penalty 
paid in terms of gradient evaluations outweighs the gain in terms of better curvature estimates and 
so convergence is slower. A similar observation was made in fT2l . However, we also observe that 
VlTE always outperforms RES for all combinations of |A|. 




(c) Ijcnn 


Figure 3: Evolution of the objective value of RES and VlTE for different values of | A\. We can see 
that the lowest value of \A\ performs better, which indicates than there is no gain at increasing this 
value passed a certain cut-off value. 
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